This notebook is a template with each step that you need to complete for the project. .
Please fill in your code where there are explicit ? markers in the notebook. You are welcome to add more cells and code as you see fit.
Once you have completed all the code implementations, please export your notebook as a HTML file so the reviews can view your code. Make sure you have all outputs correctly outputted.
File-> Export Notebook As... -> Export Notebook as HTML
There is a writeup to complete as well after all code implememtation is done. Please answer all questions and attach the necessary tables and charts. You can complete the writeup in either markdown or PDF.
Completing the code template and writeup template will cover all of the rubric points for this project.
The rubric contains "Stand Out Suggestions" for enhancing the project beyond the minimum requirements. The stand out suggestions are optional. If you decide to pursue the "stand out suggestions", you can include the code in this notebook and also discuss the results in the writeup file.
Below is example of steps to get the API username and key. Each student will have their own username and key.
kaggle.json and use the username and key.
ml.t3.medium instance (2 vCPU + 4 GiB)Python 3 (MXNet 1.8 Python 3.7 CPU Optimized)!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0" bokeh==2.0.1
!pip install autogluon --no-cache-dir
# Without --no-cache-dir, smaller aws instances may have trouble installing
# create the .kaggle directory and an empty kaggle.json file
!mkdir -p /root/.kaggle
!touch /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json
# Fill in your user name and key from creating the kaggle account and API token file
import json
kaggle_username = "honan80"
kaggle_key = "aa918219df6be42fe90e5407d289d6f6"
# Save API token the kaggle.json file
with open("/root/.kaggle/kaggle.json", "w") as f:
f.write(json.dumps({"username": kaggle_username, "key": kaggle_key}))
# Download the dataset, it will be in a .zip file so you'll need to unzip it as well.
!kaggle competitions download -c bike-sharing-demand
# If you already downloaded it you can use the -o command to overwrite the file
!unzip -o bike-sharing-demand.zip
Downloading bike-sharing-demand.zip to /root/aws_machine_learning_eng_p1 0%| | 0.00/189k [00:00<?, ?B/s] 100%|████████████████████████████████████████| 189k/189k [00:00<00:00, 4.23MB/s] Archive: bike-sharing-demand.zip inflating: sampleSubmission.csv inflating: test.csv inflating: train.csv
import pandas as pd
from autogluon.tabular import TabularPredictor
import autogluon
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) <ipython-input-2-a10a08039be2> in <module> 1 import pandas as pd ----> 2 from autogluon.tabular import TabularPredictor 3 4 import autogluon ModuleNotFoundError: No module named 'autogluon'
# Create the train dataset in pandas by reading the csv
# Set the parsing of the datetime column so you can use some of the `dt` features in pandas later
train = pd.read_csv('train.csv').drop(columns=['casual','registered'])
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | count | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 16 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 40 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 32 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 13 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 1 |
# Simple output of the train dataset to view some of the min/max/varition of the dataset features.
train.describe()
| season | holiday | workingday | weather | temp | atemp | humidity | windspeed | count | |
|---|---|---|---|---|---|---|---|---|---|
| count | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.00000 | 10886.000000 | 10886.000000 | 10886.000000 | 10886.000000 |
| mean | 2.506614 | 0.028569 | 0.680875 | 1.418427 | 20.23086 | 23.655084 | 61.886460 | 12.799395 | 191.574132 |
| std | 1.116174 | 0.166599 | 0.466159 | 0.633839 | 7.79159 | 8.474601 | 19.245033 | 8.164537 | 181.144454 |
| min | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.82000 | 0.760000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 0.000000 | 1.000000 | 13.94000 | 16.665000 | 47.000000 | 7.001500 | 42.000000 |
| 50% | 3.000000 | 0.000000 | 1.000000 | 1.000000 | 20.50000 | 24.240000 | 62.000000 | 12.998000 | 145.000000 |
| 75% | 4.000000 | 0.000000 | 1.000000 | 2.000000 | 26.24000 | 31.060000 | 77.000000 | 16.997900 | 284.000000 |
| max | 4.000000 | 1.000000 | 1.000000 | 4.000000 | 41.00000 | 45.455000 | 100.000000 | 56.996900 | 977.000000 |
# Create the test pandas dataframe in pandas by reading the csv, remember to parse the datetime!
test = pd.read_csv('test.csv')
test.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-20 00:00:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 |
| 1 | 2011-01-20 01:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 2 | 2011-01-20 02:00:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 |
| 3 | 2011-01-20 03:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
| 4 | 2011-01-20 04:00:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 |
# Same thing as train and test dataset
submission = pd.read_csv('sampleSubmission.csv')
submission.head()
| datetime | count | |
|---|---|---|
| 0 | 2011-01-20 00:00:00 | 0 |
| 1 | 2011-01-20 01:00:00 | 0 |
| 2 | 2011-01-20 02:00:00 | 0 |
| 3 | 2011-01-20 03:00:00 | 0 |
| 4 | 2011-01-20 04:00:00 | 0 |
Requirements:
count, so it is the label we are setting.casual and registered columns as they are also not present in the test dataset. root_mean_squared_error as the metric to use for evaluation.best_quality to focus on creating the best model.%%time
predictor = TabularPredictor(label='count').fit(train_data=train,
time_limit=600,
presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20220404_145049/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20220404_145049/"
AutoGluon Version: 0.4.0
Python Version: 3.7.10
Operating System: Linux
Train Data Rows: 10886
Train Data Columns: 9
Label Column: count
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
Label info (max, min, mean, stddev): (977, 1, 191.57413, 181.14445)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 2989.22 MB
Train Data (Original) Memory Usage: 1.52 MB (0.1% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting DatetimeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 5 | ['season', 'holiday', 'workingday', 'weather', 'humidity']
('object', ['datetime_as_object']) : 1 | ['datetime']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 3 | ['season', 'weather', 'humidity']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.2s = Fit runtime
9 features in original data used to generate 13 features in processed data.
Train Data (Processed) Memory Usage: 0.98 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.22s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.75s of the 599.77s of remaining time.
-101.5462 = Validation score (root_mean_squared_error)
0.05s = Training runtime
0.1s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 399.29s of the 599.31s of remaining time.
-84.1251 = Validation score (root_mean_squared_error)
0.03s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 398.88s of the 598.9s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
2022-04-04 14:50:55,455 WARNING services.py:1758 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.96gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
-131.4609 = Validation score (root_mean_squared_error)
62.21s = Training runtime
7.96s = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 324.97s of the 524.99s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-131.0542 = Validation score (root_mean_squared_error)
27.14s = Training runtime
1.39s = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 293.91s of the 493.93s of remaining time.
-116.6217 = Validation score (root_mean_squared_error)
9.53s = Training runtime
0.43s = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 281.14s of the 481.16s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-130.5088 = Validation score (root_mean_squared_error)
204.26s = Training runtime
0.12s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 72.99s of the 273.01s of remaining time.
-124.6372 = Validation score (root_mean_squared_error)
4.2s = Training runtime
0.44s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 65.52s of the 265.54s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-137.1305 = Validation score (root_mean_squared_error)
73.23s = Training runtime
0.5s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 189.49s of remaining time.
-84.1251 = Validation score (root_mean_squared_error)
0.76s = Training runtime
0.0s = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 188.61s of the 188.58s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-60.5195 = Validation score (root_mean_squared_error)
49.85s = Training runtime
3.43s = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 135.28s of the 135.25s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-55.1292 = Validation score (root_mean_squared_error)
21.6s = Training runtime
0.22s = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 110.65s of the 110.63s of remaining time.
-53.4373 = Validation score (root_mean_squared_error)
25.71s = Training runtime
0.5s = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 81.86s of the 81.83s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-55.4437 = Validation score (root_mean_squared_error)
71.62s = Training runtime
0.12s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L2 ... Training model for up to 7.46s of the 7.44s of remaining time.
-53.7632 = Validation score (root_mean_squared_error)
7.15s = Training runtime
0.5s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -3.13s of remaining time.
-52.8272 = Validation score (root_mean_squared_error)
0.44s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 603.89s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220404_145049/")
CPU times: user 1min 42s, sys: 4.08 s, total: 1min 46s Wall time: 10min 4s
predictor.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -52.827227 12.391484 507.170171 0.001100 0.441237 3 True 15
1 RandomForestMSE_BAG_L2 -53.437266 11.547101 406.354276 0.503127 25.711218 2 True 12
2 ExtraTreesMSE_BAG_L2 -53.763188 11.542974 387.793828 0.499001 7.150769 2 True 14
3 LightGBM_BAG_L2 -55.129204 11.265891 402.244157 0.221917 21.601099 2 True 11
4 CatBoost_BAG_L2 -55.443670 11.166340 452.265849 0.122366 71.622790 2 True 13
5 LightGBMXT_BAG_L2 -60.519452 14.477339 430.495622 3.433365 49.852563 2 True 10
6 KNeighborsDist_BAG_L1 -84.125061 0.104941 0.031038 0.104941 0.031038 1 True 2
7 WeightedEnsemble_L2 -84.125061 0.106165 0.793359 0.001224 0.762320 2 True 9
8 KNeighborsUnif_BAG_L1 -101.546199 0.102910 0.046686 0.102910 0.046686 1 True 1
9 RandomForestMSE_BAG_L1 -116.621736 0.433718 9.530072 0.433718 9.530072 1 True 5
10 ExtraTreesMSE_BAG_L1 -124.637158 0.437882 4.201691 0.437882 4.201691 1 True 7
11 CatBoost_BAG_L1 -130.508827 0.116350 204.257171 0.116350 204.257171 1 True 6
12 LightGBM_BAG_L1 -131.054162 1.391823 27.141194 1.391823 27.141194 1 True 4
13 LightGBMXT_BAG_L1 -131.460909 7.959073 62.209049 7.959073 62.209049 1 True 3
14 NeuralNetFastAI_BAG_L1 -137.130485 0.497275 73.226158 0.497275 73.226158 1 True 8
Number of models trained: 15
Types of models trained:
{'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'WeightedEnsembleModel', 'StackerEnsembleModel_RF', 'StackerEnsembleModel_KNN', 'StackerEnsembleModel_NNFastAiTabular', 'StackerEnsembleModel_XT'}
Bagging used: True (with 8 folds)
Multi-layer stack-ensembling used: True (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 3 | ['season', 'weather', 'humidity']
('int', ['bool']) : 2 | ['holiday', 'workingday']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20220404_145049/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L2': 'StackerEnsembleModel_XT',
'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
'KNeighborsDist_BAG_L1': -84.12506123181602,
'LightGBMXT_BAG_L1': -131.46090891834504,
'LightGBM_BAG_L1': -131.054161598899,
'RandomForestMSE_BAG_L1': -116.62173601727898,
'CatBoost_BAG_L1': -130.508827011041,
'ExtraTreesMSE_BAG_L1': -124.63715787314163,
'NeuralNetFastAI_BAG_L1': -137.13048499081756,
'WeightedEnsemble_L2': -84.12506123181602,
'LightGBMXT_BAG_L2': -60.51945154448997,
'LightGBM_BAG_L2': -55.129204049834215,
'RandomForestMSE_BAG_L2': -53.43726583024862,
'CatBoost_BAG_L2': -55.44367036077745,
'ExtraTreesMSE_BAG_L2': -53.76318784861675,
'WeightedEnsemble_L3': -52.827227053236385},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/KNeighborsUnif_BAG_L1/',
'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/KNeighborsDist_BAG_L1/',
'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/LightGBMXT_BAG_L1/',
'LightGBM_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/LightGBM_BAG_L1/',
'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/RandomForestMSE_BAG_L1/',
'CatBoost_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/CatBoost_BAG_L1/',
'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/ExtraTreesMSE_BAG_L1/',
'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20220404_145049/models/NeuralNetFastAI_BAG_L1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20220404_145049/models/WeightedEnsemble_L2/',
'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20220404_145049/models/LightGBMXT_BAG_L2/',
'LightGBM_BAG_L2': 'AutogluonModels/ag-20220404_145049/models/LightGBM_BAG_L2/',
'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20220404_145049/models/RandomForestMSE_BAG_L2/',
'CatBoost_BAG_L2': 'AutogluonModels/ag-20220404_145049/models/CatBoost_BAG_L2/',
'ExtraTreesMSE_BAG_L2': 'AutogluonModels/ag-20220404_145049/models/ExtraTreesMSE_BAG_L2/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20220404_145049/models/WeightedEnsemble_L3/'},
'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.04668617248535156,
'KNeighborsDist_BAG_L1': 0.031038284301757812,
'LightGBMXT_BAG_L1': 62.209049224853516,
'LightGBM_BAG_L1': 27.141193628311157,
'RandomForestMSE_BAG_L1': 9.53007197380066,
'CatBoost_BAG_L1': 204.25717067718506,
'ExtraTreesMSE_BAG_L1': 4.201690673828125,
'NeuralNetFastAI_BAG_L1': 73.22615814208984,
'WeightedEnsemble_L2': 0.7623202800750732,
'LightGBMXT_BAG_L2': 49.85256338119507,
'LightGBM_BAG_L2': 21.60109853744507,
'RandomForestMSE_BAG_L2': 25.711217641830444,
'CatBoost_BAG_L2': 71.62278985977173,
'ExtraTreesMSE_BAG_L2': 7.150769472122192,
'WeightedEnsemble_L3': 0.4412369728088379},
'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10291028022766113,
'KNeighborsDist_BAG_L1': 0.10494136810302734,
'LightGBMXT_BAG_L1': 7.959073066711426,
'LightGBM_BAG_L1': 1.3918232917785645,
'RandomForestMSE_BAG_L1': 0.4337184429168701,
'CatBoost_BAG_L1': 0.11635041236877441,
'ExtraTreesMSE_BAG_L1': 0.4378821849822998,
'NeuralNetFastAI_BAG_L1': 0.49727487564086914,
'WeightedEnsemble_L2': 0.0012235641479492188,
'LightGBMXT_BAG_L2': 3.4333646297454834,
'LightGBM_BAG_L2': 0.22191691398620605,
'RandomForestMSE_BAG_L2': 0.5031266212463379,
'CatBoost_BAG_L2': 0.12236642837524414,
'ExtraTreesMSE_BAG_L2': 0.49900054931640625,
'WeightedEnsemble_L3': 0.0011000633239746094},
'num_bag_folds': 8,
'max_stack_level': 3,
'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'KNeighborsDist_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'LightGBMXT_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBMXT_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -52.827227 12.391484 507.170171
1 RandomForestMSE_BAG_L2 -53.437266 11.547101 406.354276
2 ExtraTreesMSE_BAG_L2 -53.763188 11.542974 387.793828
3 LightGBM_BAG_L2 -55.129204 11.265891 402.244157
4 CatBoost_BAG_L2 -55.443670 11.166340 452.265849
5 LightGBMXT_BAG_L2 -60.519452 14.477339 430.495622
6 KNeighborsDist_BAG_L1 -84.125061 0.104941 0.031038
7 WeightedEnsemble_L2 -84.125061 0.106165 0.793359
8 KNeighborsUnif_BAG_L1 -101.546199 0.102910 0.046686
9 RandomForestMSE_BAG_L1 -116.621736 0.433718 9.530072
10 ExtraTreesMSE_BAG_L1 -124.637158 0.437882 4.201691
11 CatBoost_BAG_L1 -130.508827 0.116350 204.257171
12 LightGBM_BAG_L1 -131.054162 1.391823 27.141194
13 LightGBMXT_BAG_L1 -131.460909 7.959073 62.209049
14 NeuralNetFastAI_BAG_L1 -137.130485 0.497275 73.226158
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.001100 0.441237 3 True
1 0.503127 25.711218 2 True
2 0.499001 7.150769 2 True
3 0.221917 21.601099 2 True
4 0.122366 71.622790 2 True
5 3.433365 49.852563 2 True
6 0.104941 0.031038 1 True
7 0.001224 0.762320 2 True
8 0.102910 0.046686 1 True
9 0.433718 9.530072 1 True
10 0.437882 4.201691 1 True
11 0.116350 204.257171 1 True
12 1.391823 27.141194 1 True
13 7.959073 62.209049 1 True
14 0.497275 73.226158 1 True
fit_order
0 15
1 12
2 14
3 11
4 13
5 10
6 2
7 9
8 1
9 5
10 7
11 6
12 4
13 3
14 8 }
predictor.leaderboard()
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order 0 WeightedEnsemble_L3 -52.827227 12.391484 507.170171 0.001100 0.441237 3 True 15 1 RandomForestMSE_BAG_L2 -53.437266 11.547101 406.354276 0.503127 25.711218 2 True 12 2 ExtraTreesMSE_BAG_L2 -53.763188 11.542974 387.793828 0.499001 7.150769 2 True 14 3 LightGBM_BAG_L2 -55.129204 11.265891 402.244157 0.221917 21.601099 2 True 11 4 CatBoost_BAG_L2 -55.443670 11.166340 452.265849 0.122366 71.622790 2 True 13 5 LightGBMXT_BAG_L2 -60.519452 14.477339 430.495622 3.433365 49.852563 2 True 10 6 KNeighborsDist_BAG_L1 -84.125061 0.104941 0.031038 0.104941 0.031038 1 True 2 7 WeightedEnsemble_L2 -84.125061 0.106165 0.793359 0.001224 0.762320 2 True 9 8 KNeighborsUnif_BAG_L1 -101.546199 0.102910 0.046686 0.102910 0.046686 1 True 1 9 RandomForestMSE_BAG_L1 -116.621736 0.433718 9.530072 0.433718 9.530072 1 True 5 10 ExtraTreesMSE_BAG_L1 -124.637158 0.437882 4.201691 0.437882 4.201691 1 True 7 11 CatBoost_BAG_L1 -130.508827 0.116350 204.257171 0.116350 204.257171 1 True 6 12 LightGBM_BAG_L1 -131.054162 1.391823 27.141194 1.391823 27.141194 1 True 4 13 LightGBMXT_BAG_L1 -131.460909 7.959073 62.209049 7.959073 62.209049 1 True 3 14 NeuralNetFastAI_BAG_L1 -137.130485 0.497275 73.226158 0.497275 73.226158 1 True 8
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | WeightedEnsemble_L3 | -52.827227 | 12.391484 | 507.170171 | 0.001100 | 0.441237 | 3 | True | 15 |
| 1 | RandomForestMSE_BAG_L2 | -53.437266 | 11.547101 | 406.354276 | 0.503127 | 25.711218 | 2 | True | 12 |
| 2 | ExtraTreesMSE_BAG_L2 | -53.763188 | 11.542974 | 387.793828 | 0.499001 | 7.150769 | 2 | True | 14 |
| 3 | LightGBM_BAG_L2 | -55.129204 | 11.265891 | 402.244157 | 0.221917 | 21.601099 | 2 | True | 11 |
| 4 | CatBoost_BAG_L2 | -55.443670 | 11.166340 | 452.265849 | 0.122366 | 71.622790 | 2 | True | 13 |
| 5 | LightGBMXT_BAG_L2 | -60.519452 | 14.477339 | 430.495622 | 3.433365 | 49.852563 | 2 | True | 10 |
| 6 | KNeighborsDist_BAG_L1 | -84.125061 | 0.104941 | 0.031038 | 0.104941 | 0.031038 | 1 | True | 2 |
| 7 | WeightedEnsemble_L2 | -84.125061 | 0.106165 | 0.793359 | 0.001224 | 0.762320 | 2 | True | 9 |
| 8 | KNeighborsUnif_BAG_L1 | -101.546199 | 0.102910 | 0.046686 | 0.102910 | 0.046686 | 1 | True | 1 |
| 9 | RandomForestMSE_BAG_L1 | -116.621736 | 0.433718 | 9.530072 | 0.433718 | 9.530072 | 1 | True | 5 |
| 10 | ExtraTreesMSE_BAG_L1 | -124.637158 | 0.437882 | 4.201691 | 0.437882 | 4.201691 | 1 | True | 7 |
| 11 | CatBoost_BAG_L1 | -130.508827 | 0.116350 | 204.257171 | 0.116350 | 204.257171 | 1 | True | 6 |
| 12 | LightGBM_BAG_L1 | -131.054162 | 1.391823 | 27.141194 | 1.391823 | 27.141194 | 1 | True | 4 |
| 13 | LightGBMXT_BAG_L1 | -131.460909 | 7.959073 | 62.209049 | 7.959073 | 62.209049 | 1 | True | 3 |
| 14 | NeuralNetFastAI_BAG_L1 | -137.130485 | 0.497275 | 73.226158 | 0.497275 | 73.226158 | 1 | True | 8 |
predictions = predictor.predict(test)
predictions.head()
0 23.449482 1 41.501904 2 47.097668 3 49.068890 4 52.053726 Name: count, dtype: float32
# Describe the `predictions` series to see if there are any negative values
predictions.describe()
count 6493.000000 mean 100.946999 std 90.252510 min 3.013057 25% 20.820877 50% 63.965191 75% 169.403290 max 362.239502 Name: count, dtype: float64
# How many negative values do we have?
predictions.where(predictions.values <0).dropna()
Series([], Name: count, dtype: float32)
import numpy as np
# Set them to zero - not needed but code anyway
predictions = predictions.apply(lambda x: np.array(x).clip(min=0))
submission["count"] = predictions
submission.to_csv("submission_no_features.csv", index=False)
# I work on a work laptop and have no permissions to save json to root directory so all submissions were
# done on the website
!kaggle competitions submit -c bike-sharing-demand -f submission_no_features.csv -m "final-no-features"
100%|████████████████████████████████████████| 243k/243k [00:02<00:00, 93.6kB/s] Successfully submitted to Bike Sharing Demand
My Submissions¶# I work on a work laptop and have no permissions to save json to root directory so all submissions were
# done on the website
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- ----------------------------------------- -------- ----------- ------------ submission_no_features.csv 2022-04-04 15:05:39 final-no-features complete 1.80380 1.80380 submission_new_hpo.csv 2022-04-01 15:23:20 Trained with hyper parameter optimisation complete 0.52941 0.52941 submission_new_features.csv 2022-03-26 17:21:27 None complete 0.74591 0.74591 submission_no_features.csv 2022-03-26 16:10:44 None complete 1.85306 1.85306
1.80380¶# Create a histogram of all features to show the distribution of each one relative to the data. This is part of the exploritory data analysis
train.hist()
array([[<AxesSubplot:title={'center':'season'}>,
<AxesSubplot:title={'center':'holiday'}>,
<AxesSubplot:title={'center':'workingday'}>],
[<AxesSubplot:title={'center':'weather'}>,
<AxesSubplot:title={'center':'temp'}>,
<AxesSubplot:title={'center':'atemp'}>],
[<AxesSubplot:title={'center':'humidity'}>,
<AxesSubplot:title={'center':'windspeed'}>,
<AxesSubplot:title={'center':'count'}>]], dtype=object)
# convert date column to date time format so we can parse date
train['datetime'] = pd.to_datetime(train['datetime'])
test['datetime'] = pd.to_datetime(test['datetime'])
# create a new feature
train['year'] = train['datetime'].apply(lambda x: x.year)
test['year'] = test['datetime'].apply(lambda x: x.year)
train['month'] = train['datetime'].apply(lambda x: x.month)
test['month'] = test['datetime'].apply(lambda x: x.month)
train['day'] = train['datetime'].apply(lambda x: x.day)
test['day'] = test['datetime'].apply(lambda x: x.day)
train['hour'] = train['datetime'].apply(lambda x: x.hour)
test['hour'] = test['datetime'].apply(lambda x: x.hour)
train["season"] = train["season"].astype('category')
train["weather"] = train["weather"].astype('category')
test["season"] = test["season"].astype('category')
test["weather"] = test["weather"].astype('category')
# View our new feature
train.head()
| datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | count | year | month | day | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2011-01-01 00:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 16 | 2011 | 1 | 1 | 0 |
| 1 | 2011-01-01 01:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 40 | 2011 | 1 | 1 | 1 |
| 2 | 2011-01-01 02:00:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 32 | 2011 | 1 | 1 | 2 |
| 3 | 2011-01-01 03:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 13 | 2011 | 1 | 1 | 3 |
| 4 | 2011-01-01 04:00:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 1 | 2011 | 1 | 1 | 4 |
# View histogram of all features again now with the hour feature
train.hist()
array([[<AxesSubplot:title={'center':'datetime'}>,
<AxesSubplot:title={'center':'holiday'}>,
<AxesSubplot:title={'center':'workingday'}>],
[<AxesSubplot:title={'center':'temp'}>,
<AxesSubplot:title={'center':'atemp'}>,
<AxesSubplot:title={'center':'humidity'}>],
[<AxesSubplot:title={'center':'windspeed'}>,
<AxesSubplot:title={'center':'count'}>,
<AxesSubplot:title={'center':'year'}>],
[<AxesSubplot:title={'center':'month'}>,
<AxesSubplot:title={'center':'day'}>,
<AxesSubplot:title={'center':'hour'}>]], dtype=object)
%%time
predictor_new_features = TabularPredictor(label='count').fit(train_data=train,
time_limit=600,
presets="best_quality")
No path specified. Models will be saved in: "AutogluonModels/ag-20220404_150818/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20220404_150818/"
AutoGluon Version: 0.4.0
Python Version: 3.7.10
Operating System: Linux
Train Data Rows: 10886
Train Data Columns: 13
Label Column: count
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
Label info (max, min, mean, stddev): (977, 1, 191.57413, 181.14445)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 2160.05 MB
Train Data (Original) Memory Usage: 0.98 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 3 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', []) : 2 | ['season', 'weather']
('datetime', []) : 1 | ['datetime']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 7 | ['holiday', 'workingday', 'humidity', 'year', 'month', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 2 | ['season', 'weather']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 4 | ['humidity', 'month', 'day', 'hour']
('int', ['bool']) : 3 | ['holiday', 'workingday', 'year']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.2s = Fit runtime
13 features in original data used to generate 17 features in processed data.
Train Data (Processed) Memory Usage: 1.1 MB (0.1% of available memory)
Data preprocessing and feature engineering runtime = 0.27s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ... Training model for up to 399.71s of the 599.72s of remaining time.
-101.5462 = Validation score (root_mean_squared_error)
0.09s = Training runtime
0.11s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ... Training model for up to 399.2s of the 599.2s of remaining time.
-84.1251 = Validation score (root_mean_squared_error)
0.04s = Training runtime
0.1s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ... Training model for up to 398.75s of the 598.75s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-34.346 = Validation score (root_mean_squared_error)
80.82s = Training runtime
9.1s = Validation runtime
Fitting model: LightGBM_BAG_L1 ... Training model for up to 311.52s of the 511.52s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-33.9173 = Validation score (root_mean_squared_error)
41.73s = Training runtime
2.68s = Validation runtime
Fitting model: RandomForestMSE_BAG_L1 ... Training model for up to 266.15s of the 466.15s of remaining time.
-38.3578 = Validation score (root_mean_squared_error)
12.74s = Training runtime
0.47s = Validation runtime
Fitting model: CatBoost_BAG_L1 ... Training model for up to 250.36s of the 450.37s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-33.9543 = Validation score (root_mean_squared_error)
210.16s = Training runtime
0.22s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L1 ... Training model for up to 36.43s of the 236.43s of remaining time.
-38.2024 = Validation score (root_mean_squared_error)
5.41s = Training runtime
0.45s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ... Training model for up to 28.01s of the 228.02s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-85.7848 = Validation score (root_mean_squared_error)
43.25s = Training runtime
0.59s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 181.57s of remaining time.
-32.1333 = Validation score (root_mean_squared_error)
0.74s = Training runtime
0.0s = Validation runtime
Fitting 9 L2 models ...
Fitting model: LightGBMXT_BAG_L2 ... Training model for up to 180.72s of the 180.7s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-31.0591 = Validation score (root_mean_squared_error)
30.77s = Training runtime
1.09s = Validation runtime
Fitting model: LightGBM_BAG_L2 ... Training model for up to 147.04s of the 147.02s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-30.397 = Validation score (root_mean_squared_error)
24.09s = Training runtime
0.35s = Validation runtime
Fitting model: RandomForestMSE_BAG_L2 ... Training model for up to 119.21s of the 119.18s of remaining time.
-31.5705 = Validation score (root_mean_squared_error)
30.71s = Training runtime
0.53s = Validation runtime
Fitting model: CatBoost_BAG_L2 ... Training model for up to 85.5s of the 85.48s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-30.1755 = Validation score (root_mean_squared_error)
78.27s = Training runtime
0.11s = Validation runtime
Fitting model: ExtraTreesMSE_BAG_L2 ... Training model for up to 3.93s of the 3.9s of remaining time.
-31.5413 = Validation score (root_mean_squared_error)
8.84s = Training runtime
0.52s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the -8.18s of remaining time.
-29.97 = Validation score (root_mean_squared_error)
0.38s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 608.86s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220404_150818/")
CPU times: user 2min 1s, sys: 3.55 s, total: 2min 5s Wall time: 10min 9s
predictor_new_features.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -29.970014 15.298524 527.758593 0.000810 0.380770 3 True 15
1 CatBoost_BAG_L2 -30.175544 13.848426 472.509664 0.113463 78.272600 2 True 13
2 LightGBM_BAG_L2 -30.396965 14.089923 418.330923 0.354959 24.093859 2 True 11
3 LightGBMXT_BAG_L2 -31.059142 14.829292 425.011364 1.094328 30.774301 2 True 10
4 ExtraTreesMSE_BAG_L2 -31.541251 14.257327 403.075078 0.522364 8.838015 2 True 14
5 RandomForestMSE_BAG_L2 -31.570531 14.269784 424.948942 0.534821 30.711879 2 True 12
6 WeightedEnsemble_L2 -32.133311 12.584326 346.234097 0.000913 0.743314 2 True 9
7 LightGBM_BAG_L1 -33.917339 2.683170 41.730785 2.683170 41.730785 1 True 4
8 CatBoost_BAG_L1 -33.954270 0.224599 210.156410 0.224599 210.156410 1 True 6
9 LightGBMXT_BAG_L1 -34.345997 9.101774 80.821197 9.101774 80.821197 1 True 3
10 ExtraTreesMSE_BAG_L1 -38.202438 0.454008 5.414360 0.454008 5.414360 1 True 7
11 RandomForestMSE_BAG_L1 -38.357786 0.469987 12.742048 0.469987 12.742048 1 True 5
12 KNeighborsDist_BAG_L1 -84.125061 0.103884 0.040342 0.103884 0.040342 1 True 2
13 NeuralNetFastAI_BAG_L1 -85.784784 0.590572 43.246233 0.590572 43.246233 1 True 8
14 KNeighborsUnif_BAG_L1 -101.546199 0.106970 0.085688 0.106970 0.085688 1 True 1
Number of models trained: 15
Types of models trained:
{'StackerEnsembleModel_LGB', 'StackerEnsembleModel_CatBoost', 'WeightedEnsembleModel', 'StackerEnsembleModel_RF', 'StackerEnsembleModel_KNN', 'StackerEnsembleModel_NNFastAiTabular', 'StackerEnsembleModel_XT'}
Bagging used: True (with 8 folds)
Multi-layer stack-ensembling used: True (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 2 | ['season', 'weather']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 4 | ['humidity', 'month', 'day', 'hour']
('int', ['bool']) : 3 | ['holiday', 'workingday', 'year']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20220404_150818/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L1': 'StackerEnsembleModel_XT',
'NeuralNetFastAI_BAG_L1': 'StackerEnsembleModel_NNFastAiTabular',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBMXT_BAG_L2': 'StackerEnsembleModel_LGB',
'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
'RandomForestMSE_BAG_L2': 'StackerEnsembleModel_RF',
'CatBoost_BAG_L2': 'StackerEnsembleModel_CatBoost',
'ExtraTreesMSE_BAG_L2': 'StackerEnsembleModel_XT',
'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
'model_performance': {'KNeighborsUnif_BAG_L1': -101.54619908446061,
'KNeighborsDist_BAG_L1': -84.12506123181602,
'LightGBMXT_BAG_L1': -34.34599701170154,
'LightGBM_BAG_L1': -33.91733862651761,
'RandomForestMSE_BAG_L1': -38.35778601783482,
'CatBoost_BAG_L1': -33.95426958035513,
'ExtraTreesMSE_BAG_L1': -38.20243803292602,
'NeuralNetFastAI_BAG_L1': -85.78478385755274,
'WeightedEnsemble_L2': -32.133310507676256,
'LightGBMXT_BAG_L2': -31.059141785451878,
'LightGBM_BAG_L2': -30.396965006922045,
'RandomForestMSE_BAG_L2': -31.570531161894902,
'CatBoost_BAG_L2': -30.17554446800063,
'ExtraTreesMSE_BAG_L2': -31.541251359800743,
'WeightedEnsemble_L3': -29.970014038602915},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'KNeighborsUnif_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/KNeighborsUnif_BAG_L1/',
'KNeighborsDist_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/KNeighborsDist_BAG_L1/',
'LightGBMXT_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/LightGBMXT_BAG_L1/',
'LightGBM_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/LightGBM_BAG_L1/',
'RandomForestMSE_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/RandomForestMSE_BAG_L1/',
'CatBoost_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/CatBoost_BAG_L1/',
'ExtraTreesMSE_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/ExtraTreesMSE_BAG_L1/',
'NeuralNetFastAI_BAG_L1': 'AutogluonModels/ag-20220404_150818/models/NeuralNetFastAI_BAG_L1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20220404_150818/models/WeightedEnsemble_L2/',
'LightGBMXT_BAG_L2': 'AutogluonModels/ag-20220404_150818/models/LightGBMXT_BAG_L2/',
'LightGBM_BAG_L2': 'AutogluonModels/ag-20220404_150818/models/LightGBM_BAG_L2/',
'RandomForestMSE_BAG_L2': 'AutogluonModels/ag-20220404_150818/models/RandomForestMSE_BAG_L2/',
'CatBoost_BAG_L2': 'AutogluonModels/ag-20220404_150818/models/CatBoost_BAG_L2/',
'ExtraTreesMSE_BAG_L2': 'AutogluonModels/ag-20220404_150818/models/ExtraTreesMSE_BAG_L2/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20220404_150818/models/WeightedEnsemble_L3/'},
'model_fit_times': {'KNeighborsUnif_BAG_L1': 0.08568787574768066,
'KNeighborsDist_BAG_L1': 0.04034233093261719,
'LightGBMXT_BAG_L1': 80.82119727134705,
'LightGBM_BAG_L1': 41.73078513145447,
'RandomForestMSE_BAG_L1': 12.742047786712646,
'CatBoost_BAG_L1': 210.15641021728516,
'ExtraTreesMSE_BAG_L1': 5.414360284805298,
'NeuralNetFastAI_BAG_L1': 43.246232748031616,
'WeightedEnsemble_L2': 0.7433137893676758,
'LightGBMXT_BAG_L2': 30.774300575256348,
'LightGBM_BAG_L2': 24.09385919570923,
'RandomForestMSE_BAG_L2': 30.711878776550293,
'CatBoost_BAG_L2': 78.2726001739502,
'ExtraTreesMSE_BAG_L2': 8.838014841079712,
'WeightedEnsemble_L3': 0.3807697296142578},
'model_pred_times': {'KNeighborsUnif_BAG_L1': 0.10696959495544434,
'KNeighborsDist_BAG_L1': 0.10388350486755371,
'LightGBMXT_BAG_L1': 9.101773500442505,
'LightGBM_BAG_L1': 2.6831703186035156,
'RandomForestMSE_BAG_L1': 0.4699873924255371,
'CatBoost_BAG_L1': 0.22459888458251953,
'ExtraTreesMSE_BAG_L1': 0.4540081024169922,
'NeuralNetFastAI_BAG_L1': 0.5905721187591553,
'WeightedEnsemble_L2': 0.0009126663208007812,
'LightGBMXT_BAG_L2': 1.0943284034729004,
'LightGBM_BAG_L2': 0.35495924949645996,
'RandomForestMSE_BAG_L2': 0.5348207950592041,
'CatBoost_BAG_L2': 0.11346268653869629,
'ExtraTreesMSE_BAG_L2': 0.5223636627197266,
'WeightedEnsemble_L3': 0.0008101463317871094},
'num_bag_folds': 8,
'max_stack_level': 3,
'model_hyperparams': {'KNeighborsUnif_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'KNeighborsDist_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'LightGBMXT_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'NeuralNetFastAI_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBMXT_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'RandomForestMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'CatBoost_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'ExtraTreesMSE_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True,
'use_child_oof': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -29.970014 15.298524 527.758593
1 CatBoost_BAG_L2 -30.175544 13.848426 472.509664
2 LightGBM_BAG_L2 -30.396965 14.089923 418.330923
3 LightGBMXT_BAG_L2 -31.059142 14.829292 425.011364
4 ExtraTreesMSE_BAG_L2 -31.541251 14.257327 403.075078
5 RandomForestMSE_BAG_L2 -31.570531 14.269784 424.948942
6 WeightedEnsemble_L2 -32.133311 12.584326 346.234097
7 LightGBM_BAG_L1 -33.917339 2.683170 41.730785
8 CatBoost_BAG_L1 -33.954270 0.224599 210.156410
9 LightGBMXT_BAG_L1 -34.345997 9.101774 80.821197
10 ExtraTreesMSE_BAG_L1 -38.202438 0.454008 5.414360
11 RandomForestMSE_BAG_L1 -38.357786 0.469987 12.742048
12 KNeighborsDist_BAG_L1 -84.125061 0.103884 0.040342
13 NeuralNetFastAI_BAG_L1 -85.784784 0.590572 43.246233
14 KNeighborsUnif_BAG_L1 -101.546199 0.106970 0.085688
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.000810 0.380770 3 True
1 0.113463 78.272600 2 True
2 0.354959 24.093859 2 True
3 1.094328 30.774301 2 True
4 0.522364 8.838015 2 True
5 0.534821 30.711879 2 True
6 0.000913 0.743314 2 True
7 2.683170 41.730785 1 True
8 0.224599 210.156410 1 True
9 9.101774 80.821197 1 True
10 0.454008 5.414360 1 True
11 0.469987 12.742048 1 True
12 0.103884 0.040342 1 True
13 0.590572 43.246233 1 True
14 0.106970 0.085688 1 True
fit_order
0 15
1 13
2 11
3 10
4 14
5 12
6 9
7 4
8 6
9 3
10 7
11 5
12 2
13 8
14 1 }
# Remember to set all negative values to zero
predictions_new_features = predictor_new_features.predict(test)
# predictions_new_features = predictions_new_features.where(predictions_new_features.values <0).dropna()
predictions_new_features = predictions_new_features.apply(lambda x: np.array(x).clip(min=0))
predictions_new_features
0 16.009563
1 11.774577
2 11.313866
3 9.471573
4 7.828576
...
6488 287.488678
6489 210.481812
6490 151.887497
6491 111.099106
6492 76.658325
Name: count, Length: 6493, dtype: float64
submission_new_features = pd.read_csv('sampleSubmission.csv')
# Same submitting predictions
submission_new_features["count"] = predictions_new_features
submission_new_features.to_csv("submission_new_features.csv", index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission_new_features.csv -m "final-new features"
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
100%|█████████████████████████████████████████| 243k/243k [00:01<00:00, 154kB/s] Successfully submitted to Bike Sharing DemandfileName date description status publicScore privateScore --------------------------- ------------------- ----------------------------------------- -------- ----------- ------------ submission_new_features.csv 2022-04-04 15:22:18 final-new features complete 0.69998 0.69998 submission_no_features.csv 2022-04-04 15:05:39 final-no-features complete 1.80380 1.80380 submission_new_hpo.csv 2022-04-01 15:23:20 Trained with hyper parameter optimisation complete 0.52941 0.52941 submission_new_features.csv 2022-03-26 17:21:27 None complete 0.74591 0.74591
0.69998¶hyperparameter and hyperparameter_tune_kwargs arguments.hyperparameters = {'NN': {'num_epochs': 5}, 'GBM': {'num_boost_round': 30}, 'XGB':{'max_depth':3}}
predictor_new_hpo = TabularPredictor(label='count').fit(train_data=train,
time_limit=600, presets="best_quality",
hyperparameters=hyperparameters)
No path specified. Models will be saved in: "AutogluonModels/ag-20220404_152831/"
Presets specified: ['best_quality']
Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "AutogluonModels/ag-20220404_152831/"
AutoGluon Version: 0.4.0
Python Version: 3.7.10
Operating System: Linux
Train Data Rows: 10886
Train Data Columns: 13
Label Column: count
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
Label info (max, min, mean, stddev): (977, 1, 191.57413, 181.14445)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 2409.8 MB
Train Data (Original) Memory Usage: 0.98 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 3 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting DatetimeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('category', []) : 2 | ['season', 'weather']
('datetime', []) : 1 | ['datetime']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 7 | ['holiday', 'workingday', 'humidity', 'year', 'month', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 2 | ['season', 'weather']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 4 | ['humidity', 'month', 'day', 'hour']
('int', ['bool']) : 3 | ['holiday', 'workingday', 'year']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
0.4s = Fit runtime
13 features in original data used to generate 17 features in processed data.
Train Data (Processed) Memory Usage: 1.1 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.4s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
To change this, specify the eval_metric parameter of Predictor()
AutoGluon will fit 2 stack levels (L1 to L2) ...
WARNING: "NN" model has been deprecated in v0.4.0 and renamed to "NN_MXNET". Starting in v0.5.0, specifying "NN" or "NN_MXNET" will raise an exception. Consider instead specifying "NN_TORCH".
Fitting 3 L1 models ...
Fitting model: LightGBM_BAG_L1 ... Training model for up to 399.63s of the 599.59s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
2022-04-04 15:28:33,135 WARNING services.py:1758 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=0.81gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
-72.7019 = Validation score (root_mean_squared_error)
14.94s = Training runtime
0.08s = Validation runtime
Fitting model: XGBoost_BAG_L1 ... Training model for up to 380.37s of the 580.33s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-38.192 = Validation score (root_mean_squared_error)
112.41s = Training runtime
2.27s = Validation runtime
Fitting model: NeuralNetMXNet_BAG_L1 ... Training model for up to 264.77s of the 464.73s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-182.6 = Validation score (root_mean_squared_error)
61.89s = Training runtime
2.26s = Validation runtime
Completed 1/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 399.27s of remaining time.
-37.7721 = Validation score (root_mean_squared_error)
0.33s = Training runtime
0.0s = Validation runtime
WARNING: "NN" model has been deprecated in v0.4.0 and renamed to "NN_MXNET". Starting in v0.5.0, specifying "NN" or "NN_MXNET" will raise an exception. Consider instead specifying "NN_TORCH".
Fitting 3 L2 models ...
Fitting model: LightGBM_BAG_L2 ... Training model for up to 398.83s of the 398.81s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-52.8276 = Validation score (root_mean_squared_error)
15.84s = Training runtime
0.11s = Validation runtime
Fitting model: XGBoost_BAG_L2 ... Training model for up to 380.33s of the 380.31s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-35.7484 = Validation score (root_mean_squared_error)
16.61s = Training runtime
0.18s = Validation runtime
Fitting model: NeuralNetMXNet_BAG_L2 ... Training model for up to 360.72s of the 360.7s of remaining time.
Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
-95.6121 = Validation score (root_mean_squared_error)
62.8s = Training runtime
2.39s = Validation runtime
Repeating k-fold bagging: 2/20
Fitting model: LightGBM_BAG_L2 ... Training model for up to 294.95s of the 294.93s of remaining time.
Fitting 8 child models (S2F1 - S2F8) | Fitting with ParallelLocalFoldFittingStrategy
-52.7871 = Validation score (root_mean_squared_error)
31.51s = Training runtime
0.2s = Validation runtime
Fitting model: XGBoost_BAG_L2 ... Training model for up to 275.53s of the 275.51s of remaining time.
Fitting 8 child models (S2F1 - S2F8) | Fitting with ParallelLocalFoldFittingStrategy
-35.5749 = Validation score (root_mean_squared_error)
34.42s = Training runtime
0.34s = Validation runtime
Fitting model: NeuralNetMXNet_BAG_L2 ... Training model for up to 254.09s of the 254.08s of remaining time.
Fitting 8 child models (S2F1 - S2F8) | Fitting with ParallelLocalFoldFittingStrategy
-92.2831 = Validation score (root_mean_squared_error)
125.56s = Training runtime
4.52s = Validation runtime
Repeating k-fold bagging: 3/20
Fitting model: LightGBM_BAG_L2 ... Training model for up to 188.17s of the 188.15s of remaining time.
Fitting 8 child models (S3F1 - S3F8) | Fitting with ParallelLocalFoldFittingStrategy
-52.7733 = Validation score (root_mean_squared_error)
46.89s = Training runtime
0.32s = Validation runtime
Fitting model: XGBoost_BAG_L2 ... Training model for up to 169.85s of the 169.83s of remaining time.
Fitting 8 child models (S3F1 - S3F8) | Fitting with ParallelLocalFoldFittingStrategy
-35.5224 = Validation score (root_mean_squared_error)
50.74s = Training runtime
0.52s = Validation runtime
Fitting model: NeuralNetMXNet_BAG_L2 ... Training model for up to 150.65s of the 150.63s of remaining time.
Fitting 8 child models (S3F1 - S3F8) | Fitting with ParallelLocalFoldFittingStrategy
-92.2923 = Validation score (root_mean_squared_error)
188.56s = Training runtime
6.59s = Validation runtime
Completed 3/20 k-fold bagging repeats ...
Fitting model: WeightedEnsemble_L3 ... Training model for up to 360.0s of the 82.8s of remaining time.
-35.5222 = Validation score (root_mean_squared_error)
0.37s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 517.89s ... Best model: "WeightedEnsemble_L3"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20220404_152831/")
predictor_new_hpo.fit_summary()
*** Summary of fit() ***
Estimated performance of each model:
model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L3 -35.522172 11.728385 428.910510 0.004594 0.365865 3 True 8
1 XGBoost_BAG_L2 -35.522366 5.130113 239.987803 0.520401 50.744811 2 True 6
2 WeightedEnsemble_L2 -37.772111 2.348837 127.681848 0.001072 0.329929 2 True 4
3 XGBoost_BAG_L1 -38.192010 2.269653 112.411807 2.269653 112.411807 1 True 2
4 LightGBM_BAG_L2 -52.773341 4.927842 236.133084 0.318130 46.890091 2 True 5
5 LightGBM_BAG_L1 -72.701893 0.078112 14.940112 0.078112 14.940112 1 True 1
6 NeuralNetMXNet_BAG_L2 -92.292336 11.203391 377.799835 6.593679 188.556843 2 True 7
7 NeuralNetMXNet_BAG_L1 -182.600042 2.261947 61.891073 2.261947 61.891073 1 True 3
Number of models trained: 8
Types of models trained:
{'StackerEnsembleModel_XGBoost', 'StackerEnsembleModel_LGB', 'WeightedEnsembleModel', 'StackerEnsembleModel_TabularNeuralNetMxnet'}
Bagging used: True (with 8 folds)
Multi-layer stack-ensembling used: True (with 3 levels)
Feature Metadata (Processed):
(raw dtype, special dtypes):
('category', []) : 2 | ['season', 'weather']
('float', []) : 3 | ['temp', 'atemp', 'windspeed']
('int', []) : 4 | ['humidity', 'month', 'day', 'hour']
('int', ['bool']) : 3 | ['holiday', 'workingday', 'year']
('int', ['datetime_as_int']) : 5 | ['datetime', 'datetime.year', 'datetime.month', 'datetime.day', 'datetime.dayofweek']
Plot summary of models saved to file: AutogluonModels/ag-20220404_152831/SummaryOfModels.html
*** End of fit() summary ***
{'model_types': {'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
'XGBoost_BAG_L1': 'StackerEnsembleModel_XGBoost',
'NeuralNetMXNet_BAG_L1': 'StackerEnsembleModel_TabularNeuralNetMxnet',
'WeightedEnsemble_L2': 'WeightedEnsembleModel',
'LightGBM_BAG_L2': 'StackerEnsembleModel_LGB',
'XGBoost_BAG_L2': 'StackerEnsembleModel_XGBoost',
'NeuralNetMXNet_BAG_L2': 'StackerEnsembleModel_TabularNeuralNetMxnet',
'WeightedEnsemble_L3': 'WeightedEnsembleModel'},
'model_performance': {'LightGBM_BAG_L1': -72.7018929767409,
'XGBoost_BAG_L1': -38.192010391560714,
'NeuralNetMXNet_BAG_L1': -182.60004225470283,
'WeightedEnsemble_L2': -37.77211081260765,
'LightGBM_BAG_L2': -52.77334088447719,
'XGBoost_BAG_L2': -35.52236578300483,
'NeuralNetMXNet_BAG_L2': -92.29233587341277,
'WeightedEnsemble_L3': -35.522172401818025},
'model_best': 'WeightedEnsemble_L3',
'model_paths': {'LightGBM_BAG_L1': 'AutogluonModels/ag-20220404_152831/models/LightGBM_BAG_L1/',
'XGBoost_BAG_L1': 'AutogluonModels/ag-20220404_152831/models/XGBoost_BAG_L1/',
'NeuralNetMXNet_BAG_L1': 'AutogluonModels/ag-20220404_152831/models/NeuralNetMXNet_BAG_L1/',
'WeightedEnsemble_L2': 'AutogluonModels/ag-20220404_152831/models/WeightedEnsemble_L2/',
'LightGBM_BAG_L2': 'AutogluonModels/ag-20220404_152831/models/LightGBM_BAG_L2/',
'XGBoost_BAG_L2': 'AutogluonModels/ag-20220404_152831/models/XGBoost_BAG_L2/',
'NeuralNetMXNet_BAG_L2': 'AutogluonModels/ag-20220404_152831/models/NeuralNetMXNet_BAG_L2/',
'WeightedEnsemble_L3': 'AutogluonModels/ag-20220404_152831/models/WeightedEnsemble_L3/'},
'model_fit_times': {'LightGBM_BAG_L1': 14.940111875534058,
'XGBoost_BAG_L1': 112.41180729866028,
'NeuralNetMXNet_BAG_L1': 61.89107322692871,
'WeightedEnsemble_L2': 0.3299286365509033,
'LightGBM_BAG_L2': 46.89009118080139,
'XGBoost_BAG_L2': 50.744810581207275,
'NeuralNetMXNet_BAG_L2': 188.55684280395508,
'WeightedEnsemble_L3': 0.36586451530456543},
'model_pred_times': {'LightGBM_BAG_L1': 0.07811212539672852,
'XGBoost_BAG_L1': 2.2696526050567627,
'NeuralNetMXNet_BAG_L1': 2.2619473934173584,
'WeightedEnsemble_L2': 0.0010724067687988281,
'LightGBM_BAG_L2': 0.3181295394897461,
'XGBoost_BAG_L2': 0.5204005241394043,
'NeuralNetMXNet_BAG_L2': 6.593678951263428,
'WeightedEnsemble_L3': 0.004593849182128906},
'num_bag_folds': 8,
'max_stack_level': 3,
'model_hyperparams': {'LightGBM_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'XGBoost_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'NeuralNetMXNet_BAG_L1': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L2': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'LightGBM_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'XGBoost_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'NeuralNetMXNet_BAG_L2': {'use_orig_features': True,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True},
'WeightedEnsemble_L3': {'use_orig_features': False,
'max_base_models': 25,
'max_base_models_per_type': 5,
'save_bag_folds': True}},
'leaderboard': model score_val pred_time_val fit_time \
0 WeightedEnsemble_L3 -35.522172 11.728385 428.910510
1 XGBoost_BAG_L2 -35.522366 5.130113 239.987803
2 WeightedEnsemble_L2 -37.772111 2.348837 127.681848
3 XGBoost_BAG_L1 -38.192010 2.269653 112.411807
4 LightGBM_BAG_L2 -52.773341 4.927842 236.133084
5 LightGBM_BAG_L1 -72.701893 0.078112 14.940112
6 NeuralNetMXNet_BAG_L2 -92.292336 11.203391 377.799835
7 NeuralNetMXNet_BAG_L1 -182.600042 2.261947 61.891073
pred_time_val_marginal fit_time_marginal stack_level can_infer \
0 0.004594 0.365865 3 True
1 0.520401 50.744811 2 True
2 0.001072 0.329929 2 True
3 2.269653 112.411807 1 True
4 0.318130 46.890091 2 True
5 0.078112 14.940112 1 True
6 6.593679 188.556843 2 True
7 2.261947 61.891073 1 True
fit_order
0 8
1 6
2 4
3 2
4 5
5 1
6 7
7 3 }
# Remember to set all negative values to zero
predictions_new_hpo = predictor_new_hpo.predict(test)
# predictions_new_features = predictions_new_features.where(predictions_new_features.values <0).dropna()
predictions_new_hpo = predictions_new_hpo.apply(lambda x: np.array(x).clip(min=0))
submission_new_hpo = pd.read_csv('sampleSubmission.csv')
# Same submitting predictions
submission_new_hpo["count"] = predictions_new_hpo
submission_new_hpo.to_csv("submission_new_hpo.csv", index=False)
!kaggle competitions submit -c bike-sharing-demand -f submission_new_hpo.csv -m "final-with-hpo"
100%|█████████████████████████████████████████| 242k/242k [00:01<00:00, 136kB/s] Successfully submitted to Bike Sharing Demand
!kaggle competitions submissions -c bike-sharing-demand | tail -n +1 | head -n 6
fileName date description status publicScore privateScore --------------------------- ------------------- ----------------------------------------- -------- ----------- ------------ submission_new_hpo.csv 2022-04-04 15:38:44 final-with-hpo complete 0.51508 0.51508 submission_new_features.csv 2022-04-04 15:22:18 final-new features complete 0.69998 0.69998 submission_no_features.csv 2022-04-04 15:05:39 final-no-features complete 1.80380 1.80380 submission_new_hpo.csv 2022-04-01 15:23:20 Trained with hyper parameter optimisation complete 0.52941 0.52941
0.51508¶# Taking the top model score from each training run and creating a line plot to show improvement
# You can create these in the notebook and save them to PNG or use some other tool (e.g. google sheets, excel)
fig = pd.DataFrame(
{
"model": ["initial", "add_features", "hpo"],
"score": [-52.827227 , -29.970014 , -35.522172]
}
).plot(x="model", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_train_score.png')
# Take the 3 kaggle scores and creating a line plot to show improvement
fig = pd.DataFrame(
{
"test_eval": ["initial", "add_features", "hpo"],
"score": [1.80380, 0.69998, 0.51508]
}
).plot(x="test_eval", y="score", figsize=(8, 6)).get_figure()
fig.savefig('model_test_score.png')
hyperparameters = {'NN': {'num_epochs': 5}, 'GBM': {'num_boost_round': 30}, 'XGB':{'max_depth':3}}
# The 3 hyperparameters we tuned with the kaggle score as the result
pd.DataFrame({
"model": ["initial_model", "add_features_model", "hpo_model"],
"hpo1": [np.nan, np.nan, "NN: {num_epochs: 5}"],
"hpo2": [np.nan, np.nan, "GBM: {num_boost_round: 30}"],
"hpo3": [np.nan, np.nan, "XGB':{max_depth:3}"],
"score": [1.80, 0.70, 0.52]
})
| model | hpo1 | hpo2 | hpo3 | score | |
|---|---|---|---|---|---|
| 0 | initial_model | NaN | NaN | NaN | 1.80 |
| 1 | add_features_model | NaN | NaN | NaN | 0.70 |
| 2 | hpo_model | NN: {num_epochs: 5} | GBM: {num_boost_round: 30} | XGB':{max_depth:3} | 0.52 |
!pip install seaborn
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/site-packages (0.11.2) Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.7/site-packages (from seaborn) (1.21.5) Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/site-packages (from seaborn) (1.7.3) Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.7/site-packages (from seaborn) (3.5.0) Requirement already satisfied: pandas>=0.23 in /usr/local/lib/python3.7/site-packages (from seaborn) (1.3.4) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (1.3.2) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (0.11.0) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (21.3) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (4.28.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (9.0.1) Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (3.0.6) Requirement already satisfied: setuptools-scm>=4 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (6.3.2) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/site-packages (from matplotlib>=2.2->seaborn) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/site-packages (from pandas>=0.23->seaborn) (2021.3) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/site-packages (from python-dateutil>=2.7->matplotlib>=2.2->seaborn) (1.16.0) Requirement already satisfied: tomli>=1.0.0 in /usr/local/lib/python3.7/site-packages (from setuptools-scm>=4->matplotlib>=2.2->seaborn) (1.2.2) Requirement already satisfied: setuptools in /usr/local/lib/python3.7/site-packages (from setuptools-scm>=4->matplotlib>=2.2->seaborn) (59.5.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
import seaborn as sns
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook"
sns.heatmap(train.corr())
<AxesSubplot:>
We can see two peaks in typical usage - around 8am and then even stronger at at around 5pm / 6pm (rush hours)
fig = px.bar(train.groupby(['hour'])['count'].mean(),title='Average number of bike shares by hour of day')
fig.show()
Between january and june, demand increases month on month and peaks for the summer months and tails off from October
df_summary = pd.concat([train.groupby('month')['temp'].mean(),\
train.groupby(['month'])['count'].mean()],axis=1)
df_summary.columns = ['average_temp','total_shares']
Bike shares peak in the summer months when it is the warmest time of the year
fig_bar = px.bar(df_summary,color='average_temp',title='Average number of bike shares by month, hue denotes average monthly temperature')
fig_bar.show()